OLA-RAW: Scalable Exploration over Raw Data
نویسندگان
چکیده
In-situ processing has been proposed as a novel data exploration solution in many domains generating massive amounts of raw data, e.g., astronomy, since it provides immediate SQL querying over raw files. The performance of in-situ processing across a query workload is, however, limited by the speed of full scan, tokenizing, and parsing of the entire data. Online aggregation (OLA) has been introduced as an efficient method for data exploration that identifies uninteresting patterns faster by continuously estimating the result of a computation during the actual processing—the computation can be stopped as early as the estimate is accurate enough to be deemed uninteresting. However, existing OLA solutions have a high upfront cost of randomly shuffling and/or sampling the data. In this paper, we present OLA-RAW, a bi-level sampling scheme for parallel online aggregation over raw data. Sampling in OLA-RAW is query-driven and performed exclusively in-situ during the runtime query execution, without data reorganization. This is realized by a novel resource-aware bi-level sampling algorithm that processes data in random chunks concurrently and determines adaptively the number of sampled tuples inside a chunk. In order to avoid the cost of repetitive conversion from raw data, OLA-RAW builds and maintains a memory-resident bi-level sample synopsis incrementally. We implement OLA-RAW inside a modern in-situ data processing system and evaluate its performance across several real and synthetic datasets and file formats. Our results show that OLA-RAW chooses the sampling plan that minimizes the execution time and guarantees the required accuracy for each query in a given workload. The end result is a focused data exploration process that avoids unnecessary work and discards uninteresting data.
منابع مشابه
Scalable In-Situ Exploration over Raw Data
Application. The Palomar Transient Factory (PTF) project aims to identify and automatically classify transient astrophysical objects such as variable stars and supernovae in real-time. A list of candidates is extracted from the images taken by the telescope during a night. They are stored as a table in one or more FITS files. The initial stage in the identification process is to execute a serie...
متن کاملIncentivizing Exploration In Reinforcement Learning With Deep Predictive Models
Achieving efficient and scalable exploration in complex domains poses a major challenge in reinforcement learning. While Bayesian and PAC-MDP approaches to the exploration problem offer strong formal guarantees, they are often impractical in higher dimensions due to their reliance on enumerating the state-action space. Hence, exploration in complex domains is often performed with simple epsilon...
متن کاملAdaptive partitioning and indexing for raw data querying
Traditional database management systems approach to data analytics assumes that the input would be loaded within the DBMS, and then queried upon. However, data analytics depend on the interaction with the data analyst and as data collections grow larger and larger, data loading acts as a bottleneck and it incurs significant data-to-query delay. In this paper, we examine the NoDB paradigm, which...
متن کاملSigning the Unsigned: Robust Surface Reconstruction from Raw Pointsets
We propose a modular framework for robust 3D reconstruction from unorganized, unoriented, noisy, and outlierridden geometric data. We gain robustness and scalability over previous methods through an unsigned distance approximation to the input data followed by a global stochastic signing of the function. An isosurface reconstruction is finally deduced via a sparse linear solve. We show with exp...
متن کاملVide: an editor for the visual exploration of raw data
The analysis of binary data remains a challenge, especially for large or potentially inconsistent files. Traditionally, hex editors only make limited use of semantic information available to the user. We present an editor that supports user-supplied semantic data definitions. This semantic information is used throughout the program to realize semantic data visualization and data exploration cap...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1702.00358 شماره
صفحات -
تاریخ انتشار 2017